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Introduction 

Research on the problem of multivariate outliers has focused on several areas: 
definition, identification and accommodation (Beckman & Cook, 1983). Accommodation 
includes both rejection and weighting of the outlying observations. This review examines the 
research i. the area of multivariate outliers while emphasizing the problems associated with 
definition and identification. 

Historical Review of the Problem 
Treatment of the problem of outliers in statistical analysis can be traced back to the 
work of Bernoulli in 1777, Bernoulli stated that not all observations have the same weight or 
error, yet he questioned the practice of deleting these aberrant observations completely. 
Chauvenet proposed a method for detection of "gross errors" as early as 1850 (Dixon, 1951, 
p. 68). Throughout the nineteenth century work was done on the identification and rejection 
of outliers in the univariate case by Peirce, Gould, Glaisher, Edgeworth, Stone, and 
Newcomb (Stone, 1873; Newcomb, 1886; Beckman & Cook, 1983). Glaisher and Newcomb 
used a weighted least squares method. As of 1950 Dixon wrote that there had been no 
success in the development of a criterion for discovery of outliers by means of a general 
statistical theory. 

Definitions of Outlier 

Many of the researchers who have dealt with the problem of outliers have based their 
work on a subjective definition of an outlier. Dixon (1950) saw an outlier as a value which 
is "dubious in the eyes of the analyst" (p. 488). Grubbs (1969) said an outlier "appears to 
deviate markedly from other members of the sample" (p. 1), and Elashoff and Elashoff 
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(1970) stated it is an observation which is "extreme in some sense" (p. 4). Many other 
researchers have used the same basic definition (Pascale & Lovas, 1976; Barnett, 1978; 
Bamett & Lewis, 1978; Robertson, 1987; & Rasmussen, 1988). 

By the mid 1970's the definition of an outlier was becoming more complex. Guttman 
(1973a) saw an outlier as a spurious observation which did not come from a N(/i,o^) 
population. Gentleman and Wilk (1975b) pointed out that an outlier could be an outlier only 
"relative to some prespecified model or theory..." (p. 389), an idea supported by Gentie (in 
David, 1978). At this same time Rohlf (1975) referred to "points which are not internal to 
the cloud of points" (p. 93) as potential outliers. Along the same lines, Hawkins (1980) 
defined multivariate outliers as "values with high probabilities of occurring where the 
probability density of the true distribution is low, remote from the main body of data" (p. 
104). Campbell (1980) categorized multivariate outliers as values which "fail to maintain the 
pattern of relationships between the variables evident in the majority of the observations" 
(p.231); Hoaglin, Mosteller, and Tukey (1983) offered a similar definition when they talked 
of "different underiying behavior for certain values as compared with that for the bulk of the 
data" (p. 39). 

In the 1980's some authors began to use statistical properties to describe outliers. 
Huber (1981) stated that a large h, is a "warning signal" (p. 161) for an outlier. Anscombe 
and Tukey (quoted in Schwager & Margolin, 1982, p. 943) referred to outliers as having 
large residuals, a definition which was supported by Portnoy (1988) when he referred to an 
outlier an any observation "whose residual from some linear model is unusually large 
compared to most other residuals from this model" (p. 2). Portnoy also called an 
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observation an outlier if it is more than five standard deviations from the model. In 1983 
Beckman and Cook (1983) stated that the definition of an outlier was still "vague", but that 
with the "emphasis on modeling in recent years, 'outlier' now seems to be used ... to 
indicate any observation that does not come from the target population" (p. 121). Bamett (in 
discussion of Beckman & Cook, 1983) put the idea into perspective when iie said, "What is 
vital is not whether an arbitrary observation x, is way out in the tails of F, but whether (for 
example) the largest observation ;C(„) is unreasonably large as an observation of X(„) under F" 
(p. 150). Comrey (1985) seemed to regress from an operational definition when he said that 
outliers are "incorrect measurements that contaminate data" (p. 273). 

By the late 1980' s researchers were interested in outliers in multiple regression. 
Douzenis and Rakow (1987) defined an outlier as a value which is "extremely deviant from 
the regression line" (p. 1). Chatterjee and Hadi (1988) were referring to linear regression 
when they defined an outlier as an "observation for which the studentized residual {n or r *) 
is large in magnitude compared to other observations in the data set" (pp. 94-95). They also 
differentiated between high leverage points and influential points. High leverage points are 
"those for which the input vector is, in some sense far from the rest of the data" as defined 
by Hocking and Pendleton (quoted in Chatterjee & Hadi, 1988, p. 95); influential points are 
defined as "those observations that, individually or collectively, excessively influence the 
fitted regression equation as compared to other observations in the data set" (Chatterjee & 
Hadi, 1988, p. 95). Taylor (1989) said that the above definition of an "influential point" was 
vague, although he described outliers as observations which have an "undue influence on the 
inferences obtained from statistical models" ( p. 2), and he went on to state a concern for a 
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definition of "influence". Simonoff (1989) defined outliers as influential points or leverage 
points. 

Booth, Alam, Ahkam, and Osyk (1989) pointed out the difficulty of defining a 
multivariate outlier when they referred to a statistical outlier as a non representative 
observation whose "position may not be extreme enough on the basis of a single variable to 
demonstrate its outlying characteristics. However, the combined effects of several variables 
could be substantial enough to justify categorizing" (p. 321) it as an outlier. 

Rousseeuw and von Zomeren (1990) stated that outliers are an "empirical reality but 
their exact definition is as elusive as the exact definition of a cluster" (p. 650). They 
suggested that outliers are "observations that deviate from the model suggested by the 
majority of the point cloud, where the central model is a multivariate normal" (Rousseeuw 
and von Zomeren, 1990, p. 651). This idea goes back as far as 1975 with Rohlf. 

Outlier Identification 

Identification of outliers is critical because "many of the standard multivariate 
methods are derived under the assumption of normality and the presence of outliers will 
strongly affect inferences made from normal-based proced.ires" (Schwager & Margolin, 
1982, p. 943). 

Some of the procedures for identif\'ing multivariate outliers have been adapted from 
the univariate methods developed during the twentieth century. These procedures include the 
generalized studentized residual (Siotani, 1959), the ratio of generalized distance with k 
outlying observations deleted to generalized distance with all observations (Wilks, 1963), the 
W statistic for normality (Shapiro & Wilk, 1965), the examination of the residuals of each 
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variable regressed on the other variables (Cox, 1968; Guttman, 1973a), the gap test (Rohlf, 
1975), and a Bayesian technique (Guttman, 1973b). Devlin, Gnanadesikan, and Kettenring 
(1975) proposed the use of the sample product-moment correlation coefficient or of scatter 
plots "augmented by influence function contours" (p. 533). Brown (1975) suggested an 
outlier test which analyzes patterns among the signs of the residuals; the presence of outliers 
would disturb the balance of plus and minus signs. 

Many procedures involve the use of residuals. Prescott (1975) proposed a statistic 
using residuals standardized by individual standard deviations. Cook (1977) advocated using 
plots of residuals or examining the standardized residuals or studentized residuals. The 
studentized residual is a measure of the degree to which an observation is an outlier. Cook's 
D is considered a measure of the overall impact any single point has on the least squares 
solution. It is a combined measure of the lack of fit and of the distance in factor space 
according to Wood (1983), but it suffers from masking because it is a sequential process. 
Cook (1986) recommended using as a "basis for detecting cases that should be inspected 
for grosS errors" (p. 135). Gentle (in David, 1978) suggested that, if there is only one 
outlier, the maximum absolute studentized residual, R^, can be used to identify the outlier; 
however, in the case of multiple outliers, the effectiveness of R suffers. Gentle 
recommended several procedures for use in the case of multiple outliers: Andrews' idea to 
project the residual vector onto hyperplanes generated by all combinations of pairs of 
columns, Mickey's forward selection process, and Gentleman and Wilk's regression using 
all subsets of the data with k observations removed. 



Barnett and Lewis (1978) categorized the available procedures into six types: the 
excess spread statistic, the range spread statistic, the deviation spread statistic, the sums of 
squares statistic, the high-order moment statistic, and the extreme location statistic. The first 
type is represented by Dixon's (1951) ratio involving values; the second is a ratio of range to 
standard deviation proposed by David, Hartley, and Pearson in 1954; the third and fourth 
procedures were presented by Grubbs in 1950. The fifth type, high-order moment statistics, 
was done by Ferguson in 1961, Shapiro and Wilk (1965), and Shapiro, Wilk, and Chen 
(1968). The final type is presented by Epstein in 1960 and Lik6s (1966). 

Other multivariate procedures which are not extensions of univariate methods have 
been developed in the last twenty years. Andrews (1972) suggested a plotting technique 
using a function of the data points. Hawkins (1974) recommended using principal 
components. 

Hoaglin and Welsch (1978) proposed the use of the hat matrix, since the information 
therein can reveal outliers. The diagonal elements can be interpreted as the amount of 
leverage or influence exerted on the predicted y value by y,. Ruber (1975) agrees that the hi 
shows the researcher the points where the value of y has a large impact on the fit. Hoaglin 
and Welsch (1978) suggested using the hat matrix with the studentized residual and 
"tag(ging) as exceptional any poini for which h or /• is significant at the 10 percent level" (p. 
20). Observations for which the /?,, is larger then 2p/n should be considered to be suspect. 
Rousseeuw and Leroy (1987) found the hat matrix to be susceptible to masking; they also 
pointed out that it is based on the classical covariance matrix which is not robust. 
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Andrews and Pregibon (1978) suggested a linear model which identified deviant or 
influential observations by deleting observations, calculating the residual sum of squares, 
calculating the inverse of the inner product matrix formed after deleting observations, and 
forming a ratio. The Andrews-Pregibon statistic is based on the volume of confidence 
ellipsoids (Chatterjee & Hadi, 1988) and is a function of leverage and residual (Fung, 1990). 
Small values of the Andrews-Pregibon statistic are associated with outlying observations. 
Wood (1983) found tVu i this procedure solves the masking problem, but the number of 
subsets that need to be examined may be quite large. 

Jain (1981) considered five recursive procedures for testing the null hypothesis of n 
observations in a sample being outliers. The procedures include the extreme studentized 
deviate (BSD), the studentized range (STR), kurtosis (KUR), the /?-statistic (RST), and the 
JST using the interquartile range of the trimmed sample. The first four procedures were 
introduced by Rosner (1975, 1977). 

Schwager and Margolin (1982) focussed on the identification of outliers in a 
multivariate normal sample using a mean slippage model. They determined that the best test 
for outliers is based on Mardia's (1970) multivariate sample kurtosis b2.p which can be used 
in an initial screening for outliers. 

Hoaglin, Mosteller, and Tukey (1983) proposed a statistic based on the fourth-spread, 
(If. The fourth spread is the range defined by the upper and the lower fourth. Data points 
which are smaller than Fi - 3/2df or larger than + 3/2d{ are considered outliers. 

The Mahalanobis distance is a measure of the distance in factor space (Wood, 1983; 
Stevens, 1984). It is the most widely used procedure for detection of multivariate outliers. 
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Comrey (1985) described the Mahalanobis d-squared as a "multivariate generalization of 
using the standard score and the normal curve to determine the probability that a score of a 
specified size or larger will be obtained" (p. 275). Rasmussen (1988) also saw the 
Mahalanobis distance as a multivariate extension of Z. According to Rousseeuw and Leroy 
(1987), the Mahalanobis distance is a measure of leverage. Cook and Hawkins (in discussion 
of Rousseeuw and von Zcneren, 1990) found that the Mahalanobis D gives similar results to 
the methods proposed by Rousseeuw and von Zomeren. 

Hawkins, Bradu, and Kass (1984) suggested the median tetrad procedure which is not 
susceptible to masking and swamping. /: involves the use of elemental sets to obtain 
elemental predicted residuals, a combination of tetrad and elemental slope methods. 

The least median of squares and the least trimmed squares are "reliable data analytic 
tools that may be used to discover regression outliers in ... multivariate situations" 
(Rousseeuw and Leroy, 1987, p. 16). After fitting the majority of the data, the outliers are 
the points which lie far away from the robust fit. Portnoy (1988) for- d that regression 
quantile diagnostics work as well as the least median of squpjes and that Cooks' s D is 
quicker and not muc' v-orse. 

Graphical procedures suggested by Rousseeuv\' and Leroy (1987) include: the residual 
plot which is useful for detecting outliers: and the standardized LMS residual plotted against 
the estimated value of which "enables the data analyst to detect bad points in a simple 
display" (p. 237). Chatterjee and Hadi (1988) also discussed graphical methods, such as the 
frequency distribution of the residuals, plots of the residuals in time sequence, normal or 
half-normal probability plots, plots of the residuals versus the fitted values, plots of the 
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residuals versus XjJ = 1,2,...,^, added variable plots, components-plus-residuals plots, and 
augmented partial residual plots. Booth, Alam, Ahkam & Osyk (1989) suggested using 
principal components analysis consisting of "plots of each data set in the plane of a set's first 
two principal components" (p. 232). 

Hadi (1989b) suggested a procedure for identifying multiple outliers in multivariate 
data. He proposed ordering the data by a robust distance, dividing the data into subsets, 
computing distances for the basic subset containing p + 1 observations, re-ordering 
observations according to the new distances, and repeating these steps uritil one of three 
possible stopping points are reached. He concluded that this procedure is simple, 
inexpensive, and effective with both swamping and masking while successful in the 
identification of multivariate outliers. 

Simonoff (1989) recommended two approaches: first, do a "robust analysis and 
examine the values which are not in line with the robust fit" or second, "specifically examine 
the data for unusual values" (p. 1). These approaches deal with univariate data; for 
multivariate data one needs "an appropriate test statistic and a method of ordering the data" 
(p. 6). For the test statistic Simonoff recommended using the Mahalanobis D; for ordering 
the multivariate data he recommended using single linkage clustering to avoid masking. 
Clustering methods should work well with outliers, "since an outlier (being unusual) should 
cluster by itself" (p. 7). Hair, Anderson, and Tatham (1987) also recommended clustering, 
both hierarchical and non-hierarchical, as a means of identifying outliers. 

Many of the procedures for detecting multivariate outliers are susceptible to either 
masking, swamping, or both. Ruber (77) discussed the poor performance of some methods 



9 

11 



in recognizing outliers if two or more are "bundled together" (p. 3) on the same side of the 
sample. Masking refers to the problems encountered when the procedure is unable to 
identify all the outliers (Bradu & Hawkins, 1982; Andrews & Pregibon, 1978). Hadi (a) 
indicated that mask'ng occurs when an outlying subset goes undetected because of the 
presence of another subset. Bamett and Lewis (1978) defined masking as the "tendency for 
the presence of extreme observations not declared as outliers to mask the discordancy of 
more extreme observations under investigation as outliers" (p. 40)? -The larger the sample 
the more masking probably occurs due to the larger number of outliers (Beckman & Cook, 
1983). 

Swamping is the opposite of masking; instead of declaring too few outliers, the 
procedure declares more outliers than there actually are (Hawkins, Bradu, & Kass, 1984). 
Swamping is a phenomenon whereby legitimate observations appear to be outliers; Hadi 
(l?89a) indicated that this occurs "when good observations arc incorrectly identified as 
outliers because of the presence of another, usually remote, subset of observations" (p. 2). 

Causes of Outliers 

Outliers occur in data fcr many reasons. Among the most commonly cited reasons 
are errors in collecting, recording, coding, or entering data, and deviations from the 
experimental design (Seber, 1984; Douzenis &. Rakow, 1987; Chatterjee & Hadi, 1988; 
Portnoy, 1988); Barnett and Lewis (1978) referred to these as human error and -ignorance. 
These are the outliers which require identification in order to be corrected or rejected. Some 
outliers occur due to violations of the assumptions; they may indicate the model is not an 
appropriate one for the data, and they will affect the inferences drawn from the procedures 



10 

12 



used. Outliers may be due to the "variability inherent in the data" (Grubbs, 1969, p. 1) as 
with data f om a "heavy tailed distribution such as Student's t" (Hawkins, 1980, p. 1); in this 
case, the "outliers" are actually valid data points and should not be deleted. Data may 
actually be from two populations with different distributions, in which case the outliers 
would be observations not from the basic distribution. These outliers should be rejected or 
given small weights (Hawkins, 1980). 

Treatment of Outliers 
After an observation has been identified as an outlier, there remains the question of 
what to do with that observation. Should it be discarded even if it might be a valid data 
point? If it is not discarded, then what should be done to prevent it from having a 
disproportionate effect on the results of the analysis? These questions have been addressed 
since the 18th century with Bernoulli. Various proposals have been made as to the best 
course of action. One of the first courses of action is to correct the observation if it is due to 
an error in recording or coding. If the outlier is not due to a data entry error, there arr 
basically three alternatives: rejection, accommodation, and incorporation. Barnett and Lewis 
(1978) pointed out the importance of deciding whether the outlier is "important in its own 
right" or whether it is acting "only as an obstruction" (p. 24). Outliers may also lead the 
researcher to an understanding of important factors in the data which he had not previously 
suspected. Beckman and Cook (1983) referred to this possibility as "detecting alternative, 
rare phenomena" (d. 121); they are among several authors who stress that identifying outliers 
is more essential than accommodating them. 
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Summary 

Outliers have been called "one of the most vexing and yet widespread of statistical 
problems" (McCulloch & Meeler, in Beckman &. Cook, 1983, p. 152). Many procedures for 
identifying outliers have been developed. Most of the procedures function in the following 
ways (Andrews & Pregibon, 1978): They proceed sequentially starting with the most 
aberrant observation, or they proceed without consideration of the influence the "outlier" 
may have on the focus of the analysis. If the outlier does not influence the outcome of the 
analysis, there may be no reason for concern in identifying it. 

One of the major problems in identifying outliers is the lack of agreement on an 
operational definition of an "outlier." Most definitions refer to the outlier as being extreme 
in some manner. The observation may be extreme or outlying in terms of factor space, in 
terms of the residual, in terms of its undue influence, or in terms of its leverage; in short, 
the observation Uoes not appear to be from the same distribution as that of the bulk of the 
data. In order to satisfy the majority of the researchers who work with outlier identification 
procedures, an operational definition is a necessity; for if researchers are to be able to accept 
a procedure or even a small number of procedures for identifying outliers, they must firsi 
agree on a definition for "outlier." Once a consensus has been reached on a definition, then 
researchers can proceed in perfecting outlier identification procedures. An accepted 
definition may also facilitate the decision as to what to do with outliers after identification. 
The treatment of outliers is contingent on the type of data being studied. If the data consist 
of variables pertaining to schools, school systems, or teachers, for example, the outliers 



12 

14 



would be those which represent the most effective and the least effective in the data set; i 
such a case the identification of the outliers might be the main focus of the study. ♦ 
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